Introduction to R tidyverse

ATMDP-007: Environmental Data Science

Allan T. Souza

2024-01-24

Outline

R IDE

R IDE

  • What is an IDE?

    • An Integrated Development Environment (IDE) is a software application that provides comprehensive facilities to computer programmers for software development. It usually includes a code editor, debugger, and build automation tools.
  • Which R IDE should I use?

    • The best IDE for R programming will depend on our specific needs, preferences, and the features we require. There is no right choice, but some IDEs are more popular than others among R users.

R IDE options

  • RStudio: RStudio is widely recognized as the most popular IDE for R programming. It’s a free, open-source IDE providing an extensive range of features like a code editor, debugger, console, and R Markdown support.

  • Jupyter Notebook: This web-based IDE is known for its data science and machine learning capabilities. It allows creating and sharing interactive documents with code, text, and visualizations.

  • Visual Studio Code: A free and open-source editor, VS Code is a popular choice for R programming. It offers features like syntax highlighting, code completion, and debugging.

  • ESS (Emacs Speaks Statistics): Combines the Emacs text editor with ESS package to provide R features like syntax highlighting and debugging.

  • Eclipse StatET: Integrates Eclipse IDE with the StatET plugin, offering features like a code editor and debugger for R programming.

  • Sublime Text: A lightweight and robust code editor for R programming, offering features like syntax highlighting and code completion.

  • RKWard: A free and open-source IDE designed specifically for R, with a user-friendly interface and features like a code editor and debugger.

  • PyCharm: Known for Python programming, PyCharm also supports R programming, offering advanced features.

tidyverse

What is tidyverse?

The tidyverse is a collection of R packages designed for data science that share an underlying design philosophy, grammar, and data structures. tidyverse packages provide a cohesive and coherent toolkit for data manipulation, exploration, and visualization that is designed to make data science faster and easier.

https://www.tidyverse.org/

Tidyverse core packages

  • ggplot2: For data visualization, using a layered grammar of graphics.

  • dplyr: For data manipulation, such as filtering rows, selecting columns, and summarizing data.

  • tidyr: For tidying data, changing the layout of datasets to a tidy format.

  • readr: For importing data, particularly from CSV and similar flat file formats.

  • purrr: For functional programming, enabling operations on lists and vectors.

  • tibble: For modern reimagining of data frames, keeping things simple and tidy.

  • stringr: For string manipulation and regular expressions.

  • forcats: For handling categorical variables (factors).

Other tidyverse packages

  • Importing data
    • RSQLite: Facilitates interaction with SQLite databases directly from R.
    • RMariaDB: Provides an interface to MariaDB and MySQL databases in R.
    • RPostgres: Offers tools for communicating with PostgreSQL databases using R.
    • odbc: Enables DBI-based access to databases through ODBC drivers in R.
    • haven: Used for importing and exporting data between R and SAS, SPSS, and Stata file formats.
    • httr2: Simplifies working with HTTP protocols to interact with web APIs in R.
    • readxl: Allows easy reading of Excel files (.xls and .xlsx) into R without external dependencies.
    • googlesheets4: Provides an interface to Google Sheets, enabling the retrieval and modification of sheets data in R.
    • googledrive: Designed to interact with Google Drive from R, allowing file management and access to Drive resources.
    • rvest: Aids in web scraping, making it easy to extract data from HTML web pages in R.
    • jsonlite: A robust and fast JSON parser and generator that simplifies JSON data manipulation in R.
    • xml2: Streamlines reading, writing, and parsing XML documents with R.
  • Wrangling data
    • lubridate: Simplifies working with dates and times in R, enhancing the datetime functionalities of base R.
    • hms: Provides a simple class for storing and formatting time-of-day values, based on the difftime class.
    • blob: Introduces a simple S3 class for representing binary large objects (BLOBs) or raw vectors in R.
  • dplyr backends
    • dbplyr: A database backend for dplyr, allowing dplyr syntax to be used to manipulate data stored in a relational database, translating dplyr code into SQL.
    • dtplyr: Provides a data.table backend for dplyr, enabling the use of dplyr syntax while leveraging the speed and efficiency of data.table for large datasets.
  • Programing
    • magrittr: Introduces the pipe operator %>%, which allows for cleaner and more readable code by enabling the chaining of commands in a sequence of data operations.
    • glue: Provides an easy-to-use interface for constructing strings with embedded expressions, using a syntax that is both concise and flexible, ideal for creating dynamic outputs and SQL queries.

Installing and using tidyverse

Installing

  • Installing tidyverse is easy, we can do it directly from our console using the suit of packages stored on CRAN.
install.packages("tidyverse")

Using

  • After we installed tidyverse on our computer, we must load it.
library("tidyverse")
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
  • Note that some conflicts with other packages might emerge (this is not due to tidyverse, but rather a particularity of R).

    • This happens because there are more than one function with the same name loaded in our R environment.

    • One way around this is to use the name of the package before the function separated by ::

Advantages of tidyverse

  • tidyverse offers a significant advantage over base R primarily through its consistent and user-friendly syntax, making data manipulation and analysis more intuitive and accessible, especially for beginners.

  • Its collection of packages are designed to work together seamlessly, streamlining workflows in data science.

  • This integration reduces the learning curve and enhances productivity, allowing for more readable and maintainable code.

  • tidyverse’s emphasis on tidy data principles aids in creating more organized and understandable data structures, facilitating easier data analysis and visualization.

(Tidy data)

small parenthesis

Tidy data

Tidy data is a concept and format in data preparation that simplifies data analysis in statistics and data science. It adheres to three main principles.

Tidy data

Tidy data

Sample ID Treatment Soil Temperature
HYY-C-2024-01 Control 2
HYY-C-2024-02 Control 3
HYY-C-2024-03 Drought 6
HYY-C-2024-04 Drought 7
  • Each row represents an observation (Soil temperature from each Sample ID for each treatment).
  • Each column represents a variable (Sample ID, Treatment, Soil temperature).

Non-tidy data

Sample ID Control Drought
HYY-C-2024-01 2 NA
HYY-C-2024-02 3 NA
HYY-C-2024-03 NA 6
HYY-C-2024-04 NA 7
  • Treatment are spread across multiple columns.
  • The table mixes the variables of treatment type and soil temperature in the same row.

tidyverse syntax

tidyverse syntax

  • The magrittr package and its pipe operator %>% play a crucial role in the tidyverse syntax by enabling a more intuitive and readable flow of data manipulation steps.

  • This operator allows for chaining together functions in a sequence, transforming the data step-by-step.

  • This approach not only enhances readability but also simplifies the process of writing and understanding complex data transformations.

pipe operator

1. Enhanced Readability and Clarity:

  • Sequential Operations: The pipe operator allows for a sequence of operations to be chained together. This leads to code that reads more like a series of steps, which aligns closely with the way we logically think about data processing tasks.

  • Reduction in Nesting: Without the pipe operator, functions are nested inside each other, which can make code difficult to read and understand. The pipe operator reduces this nesting, making the code cleaner and more straightforward.

2. Easier Debugging and Maintenance:

  • Modifying Code: When using the pipe operator, it’s easier to add, remove, or change steps in our data processing pipeline. This flexibility makes debugging and maintaining code simpler.

  • Troubleshooting: We can insert a breakpoint or a diagnostic function at any point in the pipeline to inspect intermediate results, which helps in troubleshooting.

pipe operator

3. Encouraging Good Programming Practices:

  • Modular Approach: The pipe operator encourages a modular approach to code writing. Each step in the pipeline does one thing, which is a good programming practice. This modularity also makes the code more reusable.

  • Focus on Data Flow: The pipe operator emphasizes the flow of data through a series of transformations, which aligns well with many data analysis tasks.

4. Synergy with Tidyverse Philosophy:

  • Consistency with Tidyverse: The pipe operator is part of the tidyverse’s coherent and consistent approach to data science. It works seamlessly with other tidyverse packages (like dplyr, tidyr ), which are designed to work with pipe-friendly syntax.

  • Functional Style: The pipe operator supports a more functional style of programming, where the focus is on the transformation of data rather than the manipulation of state.

pipe operator

5. Improved Learning curve for beginners:

  • Intuitive for New Users: For those new to R, the pipe operator can make learning easier. The clear, step-by-step nature of piped commands is often more intuitive than nested function calls.

  • Alignment with Natural Language: The pipe operator’s syntax is somewhat analogous to natural language (“Take this data, then do this, then do that”), which can be easier for beginners to grasp.

pipe operator

Basic piping

  • x %>% f is equivalent to f(x)

  • x %>% f(y) is equivalent to f(x, y)

  • x %>% f %>% g %>% h is equivalent to h(g(f(x)))

pipe operator

Example of usage

# loading tidyverse
library(tidyverse)
  • Checking the default dataset
# default dataset preloaded with tidyverse: starwars
starwars
# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

pipe operator

Example of usage

  • Let’s take a look at the dataset.
    • glimpse() is a function from dplyr package and is like a transposed version of print().
    • The data object (starwars) precedes the operation (glimpse()), and it is connected through %>%.
# taking a look at the dataset
starwars %>%
  glimpse()
Rows: 87
Columns: 14
$ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
$ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
$ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
$ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
$ sex        <chr> "male", "none", "none", "male", "female", "male", "female",…
$ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini…
$ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
$ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
$ films      <list> <"A New Hope", "The Empire Strikes Back", "Return of the J…
$ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
$ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…

pipe operator

Example of usage

  • We obtain the same result if we do not use the pipe operator.
  • The function (glimpse()) comes first, and the data object (starwars) is nested in it.
# taking a look at the dataset
glimpse(starwars)
Rows: 87
Columns: 14
$ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
$ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
$ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
$ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
$ sex        <chr> "male", "none", "none", "male", "female", "male", "female",…
$ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini…
$ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
$ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
$ films      <list> <"A New Hope", "The Empire Strikes Back", "Return of the J…
$ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
$ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…

pipe operator

  • Useful tip

tidyverse core packages

tidyverse core packages

Basic core packages

Basic core packages

  • Foundation for data science

    • tibble: modern take on data frames, ensuring ease of use and compatibility with the tidyverse ecosystem.

    • dplyr: key for data manipulation tasks, offering intuitive functions for filtering, sorting, and summarizing data efficiently.

    • tidyr: essential for data tidying, transforming datasets into a structured, readable format.

  • Why start with these?

    • Practical relevance: these packages address the most common data processing tasks - organizing, transforming, and summarizing data.

    • Ease to learn: mastering these packages provides a strong foundation, making it easier to understand and utilize other tidyverse packages.

    • Immediate application: skills in tibble, dplyr, and tidyr are immediately applicable in a wide range of data processing and analysis scenarios.

  • Building a Strong Base

    • Focusing on these packages first provides understanding of essential tools needed for most data processing tasks.

    • Encourages a smoother transition to more complex aspects of data analysis in the tidyverse.

tibble

tibble

  • Why tibble?

    • Enhanced Data Frames: tibbles are an evolution of the traditional data frame in R, offering a more modern, tidyverse-compatible structure.

    • User-Friendly: easier to use and understand, especially for those new to R.

  • tibble vs. data.frame:

    • Printing: tibble print a small subset of data, making them more manageable with large datasets.

    • Data type preservation: unlike data frames, tibble do not convert character vectors to factors by default.

    • Subsetting behavior: tibble is more consistent in returning tibble structures, whereas data.frame can change structure based on the subset.

    • Row names: tibble do not use row names, which simplifies their structure and avoids some common data manipulation errors.

    • Column subsetting: tibble is more predictable with column subsetting, always returning a tibble even with a single column, unlike data frames which might return a vector.

    • Non-syntactic names: tibble allows columns to have non-syntactic names without requiring backticks, making them flexible with data from diverse sources.

tibble

A few relevant tibble functions

  • as_tibble(): Converts existing data structures into tibbles.

  • tibble(): Creates tibble data frames directly.

  • tribble(): Allows for easy creation of tibbles with a readable layout.

  • add_row(): Adds rows to an existing tibble.

  • add_column(): Adds columns to an existing tibble.

Note that there are more functions in tibble package that might be relevant to our interests.

tibble

Using the pipe operator (%>%) to connect processing steps, we can transform a dataset into tibble, and add a new row in one go.

library(tidyverse)
# Convert iris to a tibble and add a new row using pipe operator
iris %>% # this is a classic default dataset in R 
  as_tibble() %>% # transforming the data.frame to tibble
  add_row(Sepal.Length = 5.5, 
          Sepal.Width = 3, 
          Petal.Length = 1.2, 
          Petal.Width = 0.1, 
          Species = "cherry") %>% # adding a new row in the dataset
  tail() # to check the row added
# A tibble: 6 × 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
         <dbl>       <dbl>        <dbl>       <dbl> <chr>    
1          6.7         3            5.2         2.3 virginica
2          6.3         2.5          5           1.9 virginica
3          6.5         3            5.2         2   virginica
4          6.2         3.4          5.4         2.3 virginica
5          5.9         3            5.1         1.8 virginica
6          5.5         3            1.2         0.1 cherry   

tibble

Otherwise we could use the R base functions to do the same job.

# Convert iris to a data frame (it's already a data frame, so this step is more about clarity)
iris_df <- data.frame(iris)

# Extend the levels of the Species factor to include "cherry"
iris_df$Species <- factor(iris_df$Species, levels = c(levels(iris_df$Species), "cherry"))

# Adding a new row to the data frame
new_row <- data.frame(Sepal.Length = 5.5, 
                      Sepal.Width = 3, 
                      Petal.Length = 1.2, 
                      Petal.Width = 0.1, 
                      Species = "cherry")
iris_df <- rbind(iris_df, new_row)

# Display the last few rows of the data frame to check the row added
tail(iris_df)
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
146          6.7         3.0          5.2         2.3 virginica
147          6.3         2.5          5.0         1.9 virginica
148          6.5         3.0          5.2         2.0 virginica
149          6.2         3.4          5.4         2.3 virginica
150          5.9         3.0          5.1         1.8 virginica
151          5.5         3.0          1.2         0.1    cherry

Note that the the code is more complex (e.g. adding the level in Species) and the output does not show the data classes (e.g. dbl, chr, fct).

tibble

  • Creating a tibble from the scratch using tribble()

    • easy step-by-step dataset building

    • entries are row by row oriented

    • good for small datasets

    • columns are identified by ~ and separated by ,

library(tibble)
# Creating a tribble
felines <- tribble(
  ~species, ~weight, ~length,
  "lion",   190,     2.4,
  "tiger",  220,     2.5,
  "jaguar", 100,     2.0
)

felines
# A tibble: 3 × 3
  species weight length
  <chr>    <dbl>  <dbl>
1 lion       190    2.4
2 tiger      220    2.5
3 jaguar     100    2  

tibble

  • Creating a data.frame from the scratch using base R.

    • the information is not row by row oriented

    • not easy to follow in slightly larger datasets

# Creating a data frame in base R
felines <- data.frame(
  species = c("lion", "tiger", "jaguar"),
  weight = c(190, 220, 100),
  length = c(2.4, 2.5, 2.0)
)

felines
  species weight length
1    lion    190    2.4
2   tiger    220    2.5
3  jaguar    100    2.0

tibble

library(tibble)
felines <- tribble(
  ~species, ~weight, ~length,
  "lion",   190,     2.4,
  "tiger",  220,     2.5,
  "jaguar", 100,     2.0
)

felines %>%
  add_row(species = "leopard", # Adding a row for the leopard
          weight = 90, 
          length = 2.1) %>% 
  add_column(scientific_name = c("Panthera leo", "Panthera tigris", "Panthera onca", "Panthera pardus"), # Adding a column for the scientific name
             .after = "species") #specifying where the new column will be added
# A tibble: 4 × 4
  species scientific_name weight length
  <chr>   <chr>            <dbl>  <dbl>
1 lion    Panthera leo       190    2.4
2 tiger   Panthera tigris    220    2.5
3 jaguar  Panthera onca      100    2  
4 leopard Panthera pardus     90    2.1

tibble

  • Adding rows and columns in a using base R.
    • More complex code and less intuitive workflow.
# Creating a data.frame
felines <- data.frame(species = c("lion", "tiger", "jaguar"),
                      weight = c(190, 220, 100),
                      length = c(2.4, 2.5, 2.0))

# Adding a row for the leopard
felines <- rbind(felines, c("leopard", 90, 2.1))

# Adding a column for the scientific name
felines$scientific_name <- c("Panthera leo", "Panthera tigris", "Panthera onca", "Panthera pardus")

# Reordering columns to place scientific_name after species
felines <- felines[c("species", "scientific_name", "weight", "length")]

felines
  species scientific_name weight length
1    lion    Panthera leo    190    2.4
2   tiger Panthera tigris    220    2.5
3  jaguar   Panthera onca    100      2
4 leopard Panthera pardus     90    2.1

dplyr

dplyr

  • dplyr empowers users to efficiently handle and transform data, making it a vital tool for any R data processing and manipulation task.

  • Probably the most used, and maybe important, of all tidyverse core packages.

    • Some relevant dplyr functions:

      • filter(): Extracts rows based on specified conditions.

      • select(): Chooses columns, simplifying dataset structure.

      • rename(): Changes the names of individual variables.

      • mutate(): Creates or transforms variables, enhancing data with new insights.

      • group_by(): Facilitates grouped calculations, enhancing data analysis scope.

      • summarise(): Aggregates data, ideal for generating summaries.

      • *_join(): Merges two datasets based on a common key. It is actually used as full_join, inner_join(), left_join() and right_join().

      • arrange(): Orders rows by variable values.

      • distinct(): Keep only unique/distinct rows from a data frame.

      • relocate(): Change the order of the columns in the dataset.

      • if_else(): A vectorized if-else function. Similar to, but more rigorous than, ifelse().

      • case_when(): A vectorized set of if-else statements.

*Note that other functions from dplyr might be more relevant depending on our specific needs.

dplyr

With dplyr is possible to make a series of data processing without the need to ave a new object in R. Each step is integrated with the pipe operator %>% and the code is easily comprehended.

library(tidyverse)

# Using the storms dataset
storms %>%
  filter(year >= 2020) %>% # Keep only storms from 2020 onwards
  select(name, year, status, wind) %>% # Choose specific columns
  group_by(name, year) %>% # Grouping by name and year
  summarise(average_wind = mean(wind, na.rm = TRUE), .groups = "drop") %>% # calculating the average wind speed 
  rename(average_wind_speed = average_wind) %>% # Change 'average_wind' to 'average_wind_speed'
  arrange(desc(average_wind_speed)) %>% # Order by wind speed in descending order
  relocate(year, .before = name) # Move 'year' column right before 'name'
# A tibble: 66 × 3
    year name    average_wind_speed
   <dbl> <chr>                <dbl>
 1  2021 Sam                   93.2
 2  2021 Larry                 82.2
 3  2022 Ian                   70.5
 4  2020 Teddy                 70.3
 5  2022 Fiona                 69.2
 6  2020 Delta                 67.7
 7  2020 Iota                  63.7
 8  2020 Zeta                  59.2
 9  2020 Isaias                57.6
10  2020 Epsilon               56.3
# ℹ 56 more rows

dplyr

It is also possible to achieve the same result using base R syntax and functions. However the code is a bit less clear and it required the saving of different objects on the R environment.

library(tidyverse) # to access the storms dataset

# Filtering storms from 2020 onwards and selecting specific columns
storms_filtered <- subset(storms, year >= 2020, select = c(name, year, status, wind))

# Grouping by name and year, and calculating the average wind speed
storms_aggregated <- aggregate(wind ~ name + year, data = storms_filtered, 
                               FUN = function(x) mean(x, na.rm = TRUE))

# Renaming 'wind' to 'average_wind_speed'
names(storms_aggregated)[which(names(storms_aggregated) == "wind")] <- "average_wind_speed"

# Ordering by average wind speed in descending order
storms_ordered <- storms_aggregated[order(-storms_aggregated$average_wind_speed), ]

# Moving 'year' column right before 'name'
storms_ordered <- storms_ordered[c("year", "name", "average_wind_speed")]

# View the final result
storms_ordered
   year      name average_wind_speed
48 2021       Sam           93.22034
42 2021     Larry           82.17391
60 2022       Ian           70.50000
25 2020     Teddy           70.30612
57 2022     Fiona           69.18033
5  2020     Delta           67.74194
14 2020      Iota           63.65385
30 2020      Zeta           59.20000
15 2020    Isaias           57.63889
8  2020   Epsilon           56.34146
55 2022      Earl           55.57692
64 2022    Martin           55.23810
9  2020       Eta           54.82759
52 2022    Bonnie           54.27273
39 2021       Ida           53.62500
24 2020     Sally           52.32143
18 2020     Laura           52.26190
35 2021      Elsa           51.04651
37 2021     Grace           50.78947
13 2020     Hanna           50.27778
54 2022  Danielle           48.70968
22 2020  Paulette           48.01136
20 2020      Nana           46.92308
65 2022    Nicole           46.15385
38 2021     Henri           44.52381
61 2022     Julia           44.28571
1  2020    Arthur           43.61111
50 2021     Wanda           43.61111
63 2022      Lisa           42.70833
51 2022      Alex           42.64706
27 2020     Theta           42.12121
40 2021    Julian           41.92308
11 2020     Gamma           41.42857
19 2020     Marco           41.42857
58 2022    Gaston           41.17647
45 2021    Odette           39.41860
31 2021       Ana           39.41176
17 2020      Kyle           39.28571
49 2021    Victor           37.72727
46 2021     Peter           37.50000
4  2020 Cristobal           37.22222
12 2020   Gonzalo           37.17391
44 2021  Nicholas           36.66667
3  2020      Beta           36.02941
62 2022      Karl           35.83333
32 2021      Bill           35.50000
33 2021 Claudette           34.56522
43 2021     Mindy           34.50000
34 2021     Danny           33.57143
53 2022     Colin           33.33333
7  2020   Edouard           33.18182
16 2020 Josephine           33.12500
28 2020     Vicky           32.91667
29 2020   Wilfred           32.64706
47 2021      Rose           32.50000
36 2021      Fred           32.44444
2  2020    Bertha           32.14286
23 2020      Rene           31.87500
6  2020     Dolly           31.66667
21 2020      Omar           30.74074
41 2021      Kate           30.71429
66 2022    Twelve           30.00000
59 2022   Hermine           29.58333
10 2020       Fay           28.51852
26 2020       Ten           27.50000
56 2022    Eleven           25.83333

dplyr

This code provides some more examples of how the dplyr functions can be used to perform advanced data processing, highlighting the package’s strengths in data manipulation.

library(tidyverse)

storms %>%
  mutate(wind_category = case_when( # Create a new column 'wind_category' based on wind speed using case_when
    wind < 74 ~ "Not a hurricane",
    wind >= 74 & wind < 96 ~ "Category 1",
    wind >= 96 & wind < 111 ~ "Category 2",
    wind >= 111 & wind < 130 ~ "Category 3",
    wind >= 130 & wind < 157 ~ "Category 4",
    TRUE ~ "Category 5"
  )) %>%
  mutate(major_hurricane = if_else(condition = wind_category %in% c("Category 3", "Category 4", "Category 5"), 
                                   true = "Yes", false = "No")) %>% # Create a new column 'major_hurricane' using if_else to identify major hurricanes (Category 3 and above)
  filter(major_hurricane == "Yes") %>% # Filter to keep only rows where 'major_hurricane' is "Yes"
  distinct(name, year, major_hurricane) %>% # Keep only distinct rows based on 'name', 'year', and 'major_hurricane'
  count(year, major_hurricane, sort = TRUE, name = "major_hurricane_per_year") %>% # Count the number of major hurricanes per year and sort the result
  select(-major_hurricane) # Remove the 'major_hurricane' column from the final result
# A tibble: 37 × 2
    year major_hurricane_per_year
   <dbl>                    <int>
 1  1999                        5
 2  2005                        5
 3  2020                        5
 4  2004                        4
 5  2008                        4
 6  2010                        4
 7  2017                        4
 8  1988                        3
 9  1995                        3
10  1978                        2
# ℹ 27 more rows

dplyr

Again, it is possible to achieve the same end goal using base R and stats functions, but the code is not so straightforward, and save intermediate data objects in R environment.

library(tidyverse) # to have access to the storms dataset

# Create a new column 'wind_category' based on wind speed
storms$wind_category <- with(storms, ifelse(wind < 74, "Not a hurricane",
                           ifelse(wind < 96, "Category 1",
                           ifelse(wind < 111, "Category 2",
                           ifelse(wind < 130, "Category 3",
                           ifelse(wind < 157, "Category 4", "Category 5"))))))

# Create a new column 'major_hurricane' to identify major hurricanes
storms$major_hurricane <- ifelse(test = storms$wind_category %in% c("Category 3", "Category 4", "Category 5"),
                                yes = "Yes", no = "No")

storms_unique <- storms[!duplicated(storms[c("name", "year", "major_hurricane")]), ] # Keep only distinct rows based on 'name', 'year', and 'major_hurricane'

major_hurricanes <- storms_unique[storms_unique$major_hurricane == "Yes", ] # Filter to keep only major hurricanes

major_hurricane_count <- aggregate(cbind(major_hurricane_per_year = major_hurricanes$wind) ~ year, # Count the number of major hurricanes per year
                                   data = major_hurricanes, FUN = length)

major_hurricane_count <- major_hurricane_count[order(-major_hurricane_count$major_hurricane_per_year), ] # Sorting by count in descending order

head(major_hurricane_count) # View the result
   year major_hurricane_per_year
17 1999                        5
23 2005                        5
35 2020                        5
22 2004                        4
25 2008                        4
27 2010                        4

dplyr

  • Joining datasets using dplyr:
    • Firstly we create a new dataset using tribble().

    • Then we join them using different versions of the *_join() function from dplyr.

library(tidyverse)

# Create a small custom dataset using tribble
storm_categories <- tribble(
  ~category,      ~description,
  "Category 1",   "Very dangerous winds",
  "Category 2",   "Extremely dangerous winds",
  "Category 3",   "Devastating damage",
  "Category 4",   "Catastrophic damage",
  "Category 5",   "High chance of being deadly")

storms <- storms %>% # Add a wind category to the storms dataset for joining
  mutate(wind_category = case_when(
    wind < 74 ~ "Not a hurricane",
    wind >= 74 & wind < 96 ~ "Category 1",
    wind >= 96 & wind < 111 ~ "Category 2",
    wind >= 111 & wind < 130 ~ "Category 3",
    wind >= 130 & wind < 157 ~ "Category 4",
    TRUE ~ "Category 5"))

Full Join: Include all rows from both datasets

full_join_result <- full_join(storms, storm_categories, by = c("wind_category" = "category"))
full_join_result %>% 
  select(name, year, wind_category, description) %>% 
  slice(1:3)
# A tibble: 3 × 4
  name   year wind_category   description
  <chr> <dbl> <chr>           <chr>      
1 Amy    1975 Not a hurricane <NA>       
2 Amy    1975 Not a hurricane <NA>       
3 Amy    1975 Not a hurricane <NA>       

Inner Join: Include only rows with matching categories

inner_join_result <- inner_join(storms, storm_categories, by = c("wind_category" = "category"))
inner_join_result %>% 
  select(name, year, wind_category, description) %>% 
  slice(1:3)
# A tibble: 3 × 4
  name      year wind_category description              
  <chr>    <dbl> <chr>         <chr>                    
1 Blanche   1975 Category 1    Very dangerous winds     
2 Blanche   1975 Category 1    Very dangerous winds     
3 Caroline  1975 Category 2    Extremely dangerous winds

dplyr

  • Joining datasets using dplyr:
    • Firstly we create a new dataset using tribble().

    • Then we join them using different versions of the *_join() function from dplyr.

library(tidyverse)

# Create a small custom dataset using tribble
storm_categories <- tribble(
  ~category,      ~description,
  "Category 1",   "Very dangerous winds",
  "Category 2",   "Extremely dangerous winds",
  "Category 3",   "Devastating damage",
  "Category 4",   "Catastrophic damage",
  "Category 5",   "High chance of being deadly")

storms <- storms %>% # Add a wind category to the storms dataset for joining
  mutate(wind_category = case_when(
    wind < 74 ~ "Not a hurricane",
    wind >= 74 & wind < 96 ~ "Category 1",
    wind >= 96 & wind < 111 ~ "Category 2",
    wind >= 111 & wind < 130 ~ "Category 3",
    wind >= 130 & wind < 157 ~ "Category 4",
    TRUE ~ "Category 5"))

Right Join: Include all rows from storm_categories and only matching rows from storms

right_join_result <- right_join(storms, storm_categories, by = c("wind_category" = "category")) 
right_join_result %>% 
  select(name, year, wind_category, description) %>% 
  slice(1:3)
# A tibble: 3 × 4
  name      year wind_category description              
  <chr>    <dbl> <chr>         <chr>                    
1 Blanche   1975 Category 1    Very dangerous winds     
2 Blanche   1975 Category 1    Very dangerous winds     
3 Caroline  1975 Category 2    Extremely dangerous winds

Left Join: Include all rows from storms and only matching rows from storm_categories

left_join_result <- left_join(storms, storm_categories, by = c("wind_category" = "category"))
left_join_result %>% 
  select(name, year, wind_category, description) %>% 
  slice(1:3)
# A tibble: 3 × 4
  name   year wind_category   description
  <chr> <dbl> <chr>           <chr>      
1 Amy    1975 Not a hurricane <NA>       
2 Amy    1975 Not a hurricane <NA>       
3 Amy    1975 Not a hurricane <NA>       

dplyr

dplyr merge Description
full_join() merge(..., all = TRUE) This performs a full outer join, combining all rows from both datasets. When there’s no match in one dataset, NA values are introduced in the resulting dataset.
inner_join() merge(..., all = FALSE) This conducts an inner join, returning only the rows with matching values in both datasets. Rows without a corresponding match in either dataset are excluded.
left_join() merge(..., all.x = TRUE) This performs a left outer join, retaining all rows from the first dataset and matching rows from the second dataset. NA values are filled in where the second dataset has no match.
right_join() merge(..., all.y = TRUE) This executes a right outer join, keeping all rows from the second dataset and matching rows from the first dataset. Rows in the second dataset without a match in the first dataset are filled with NA in the resulting dataset.

tidyr

tidyr

tidyr is a handy package when it comes to data processing and manipulation, allowing transforming messy data into a structured, tidy format product.

  • Key Solutions:

    • Handling Messy Data: Streamlines the process of cleaning and organizing data, making it compatible with other Tidyverse packages.

    • Data Transformation: Provides tools for converting between wide and long formats, handling missing values, and separating or uniting columns.

  • Some relevant tidyr functions:

    • pivot_longer(): Transforms data from wide to long format, making it easier to analyze with other Tidyverse tools.

    • pivot_wider(): Converts data from long to wide format, useful for creating human-readable tables.

    • separate_wider_delim(): Splits a single column into multiple columns, ideal for unpacking complex fields.

    • unite(): Combines multiple columns into a single column, simplifying datasets with redundant columns.

    • drop_na(): Removes rows with missing values, streamlining datasets for analysis.

    • replace_na(): Substitutes NA values with specified replacements, maintaining data integrity.

*Note that other functions from tidyr might be more relevant depending on our specific needs.

tidyr

Transform storms dataset into a wider format using pivot_wider().

library(tidyverse)
storms_wider <- storms %>%
  filter(year >= 2013) %>% 
  select(year, status, wind) %>%
  group_by(year, status) %>% 
  summarise(max_wind_speed = max(wind), .groups = "drop") %>% 
  pivot_wider(names_from = "status", values_from = "max_wind_speed")

storms_wider
# A tibble: 10 × 10
    year disturbance extratropical hurricane `other low` subtropical depressio…¹
   <dbl>       <int>         <int>     <int>       <int>                   <int>
 1  2013          35            45        80          50                      30
 2  2014          25            65       125          40                      NA
 3  2015          NA            65       135          55                      NA
 4  2016          NA            70       145          45                      NA
 5  2017          40            75       155          45                      30
 6  2018          45            75       140          40                      30
 7  2019          50            75       160          80                      NA
 8  2020          40            75       135          45                      30
 9  2021          40            70       135          40                      NA
10  2022          40           100       140          55                      NA
# ℹ abbreviated name: ¹​`subtropical depression`
# ℹ 4 more variables: `subtropical storm` <int>, `tropical depression` <int>,
#   `tropical storm` <int>, `tropical wave` <int>

Transform the dataset into long format using pivot_longer().

storms_wider %>%
  pivot_longer(cols = disturbance:`tropical wave`, names_to = "status", values_to = "max_wind_speed") %>%
  drop_na()
# A tibble: 73 × 3
    year status                 max_wind_speed
   <dbl> <chr>                           <int>
 1  2013 disturbance                        35
 2  2013 extratropical                      45
 3  2013 hurricane                          80
 4  2013 other low                          50
 5  2013 subtropical depression             30
 6  2013 subtropical storm                  55
 7  2013 tropical depression                30
 8  2013 tropical storm                     60
 9  2014 disturbance                        25
10  2014 extratropical                      65
# ℹ 63 more rows

tidyr

It is possible to do the same thing using R packages outside the tidyverse. However, as mentioned before, the code is not so straightforward and more difficult to follow (especially for beginners).

library(tidyverse) # to get the storms dataset

# Filter the dataset for years 2013 and onwards
storms_filtered <- subset(storms, year >= 2013)

# Select only the year, status, and wind columns
storms_selected <- storms_filtered[, c("year", "status", "wind")]

# Aggregate to find the maximum wind speed for each year and status combination
storms_aggregated <- aggregate(wind ~ year + status, data = storms_selected, max)

# Renaming the aggregated column
names(storms_aggregated)[which(names(storms_aggregated) == "wind")] <- "max_wind_speed"

# Reshape the data from long to wide format
storms_wider <- reshape(storms_aggregated, timevar = "status", idvar = "year", direction = "wide")

# Renaming the columns to match the column names as they are recorded in storms dataset
colnames(storms_wider) <- gsub("max_wind_speed.", "", colnames(storms_wider))

storms_wider # View the result
   year disturbance extratropical hurricane other low subtropical depression
1  2013          35            45        80        50                     30
2  2014          25            65       125        40                     NA
3  2017          40            75       155        45                     30
4  2018          45            75       140        40                     30
5  2019          50            75       160        80                     NA
6  2020          40            75       135        45                     30
7  2021          40            70       135        40                     NA
8  2022          40           100       140        55                     NA
11 2015          NA            65       135        55                     NA
12 2016          NA            70       145        45                     NA
   subtropical storm tropical depression tropical storm tropical wave
1                 55                  30             60            NA
2                 50                  30             60            NA
3                 NA                  30             60            30
4                 50                  30             60            40
5                 55                  30             60            NA
6                 60                  30             60            NA
7                 50                  30             60            NA
8                 45                  30             60            NA
11                50                  30             60            NA
12                55                  30             60            NA

tidyr

It is possible to do the same thing using R packages outside the tidyverse. However, as mentioned before, the code is not so straightforward and more difficult to follow (especially for beginners).

library(reshape2)
# Reshape from wide to long format using melt
storms_longer <- melt(storms_wider, id.vars = "year", 
                      measure.vars = names(storms_wider)[names(storms_wider) != "year"],
                      variable.name = "status", value.name = "max_wind_speed")

# Drop rows with NA in 'max_wind_speed'
storms_longer <- storms_longer[!is.na(storms_longer$max_wind_speed), ]

# Modify the 'status' column to remove the prefix and keep only the text after the dot
storms_longer$status <- sub(".*\\.", "", storms_longer$status)

# Order the rows by 'year'
storms_longer <- storms_longer[order(storms_longer$year), ]

storms_longer[1:10, ] # View the result of the first 10 rows
   year                 status max_wind_speed
1  2013            disturbance             35
11 2013          extratropical             45
21 2013              hurricane             80
31 2013              other low             50
41 2013 subtropical depression             30
51 2013      subtropical storm             55
61 2013    tropical depression             30
71 2013         tropical storm             60
2  2014            disturbance             25
12 2014          extratropical             65

tidyr

Using unite() to combine year, month and day into a single column named date.

library(tidyverse)
storms %>%
  unite("date", year, month, day, sep = "-")
# A tibble: 19,537 × 13
   name  date       hour   lat  long status              category  wind pressure
   <chr> <chr>     <dbl> <dbl> <dbl> <fct>                  <dbl> <int>    <int>
 1 Amy   1975-6-27     0  27.5 -79   tropical depression       NA    25     1013
 2 Amy   1975-6-27     6  28.5 -79   tropical depression       NA    25     1013
 3 Amy   1975-6-27    12  29.5 -79   tropical depression       NA    25     1013
 4 Amy   1975-6-27    18  30.5 -79   tropical depression       NA    25     1013
 5 Amy   1975-6-28     0  31.5 -78.8 tropical depression       NA    25     1012
 6 Amy   1975-6-28     6  32.4 -78.7 tropical depression       NA    25     1012
 7 Amy   1975-6-28    12  33.3 -78   tropical depression       NA    25     1011
 8 Amy   1975-6-28    18  34   -77   tropical depression       NA    30     1006
 9 Amy   1975-6-29     0  34.4 -75.8 tropical storm            NA    35     1004
10 Amy   1975-6-29     6  34   -74.8 tropical storm            NA    40     1002
# ℹ 19,527 more rows
# ℹ 4 more variables: tropicalstorm_force_diameter <int>,
#   hurricane_force_diameter <int>, wind_category <chr>, major_hurricane <chr>

Using separate_wider_delim() to split the date column back into three columns.

storms %>%
  unite("date", year, month, day, sep = "-") %>% 
  separate_wider_delim(cols = date, delim = "-", names = c("year", "month", "day")) %>% 
  mutate(category = replace_na(category, 0)) %>% 
  filter(category == 0)
# A tibble: 14,734 × 15
   name  year  month day    hour   lat  long status      category  wind pressure
   <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <fct>          <dbl> <int>    <int>
 1 Amy   1975  6     27        0  27.5 -79   tropical d…        0    25     1013
 2 Amy   1975  6     27        6  28.5 -79   tropical d…        0    25     1013
 3 Amy   1975  6     27       12  29.5 -79   tropical d…        0    25     1013
 4 Amy   1975  6     27       18  30.5 -79   tropical d…        0    25     1013
 5 Amy   1975  6     28        0  31.5 -78.8 tropical d…        0    25     1012
 6 Amy   1975  6     28        6  32.4 -78.7 tropical d…        0    25     1012
 7 Amy   1975  6     28       12  33.3 -78   tropical d…        0    25     1011
 8 Amy   1975  6     28       18  34   -77   tropical d…        0    30     1006
 9 Amy   1975  6     29        0  34.4 -75.8 tropical s…        0    35     1004
10 Amy   1975  6     29        6  34   -74.8 tropical s…        0    40     1002
# ℹ 14,724 more rows
# ℹ 4 more variables: tropicalstorm_force_diameter <int>,
#   hurricane_force_diameter <int>, wind_category <chr>, major_hurricane <chr>

tidyr

Replicating the functionalities of the unite() function can be also done using base R functions.

library(tidyverse) # to load the storms dataset

# Create 'date' column by concatenating 'year', 'month', and 'day'
storms$date <- with(storms, paste(year, month, day, sep = "-"))

# Remove the 'year', 'month', and 'day' columns
storms <- storms[, !(names(storms) %in% c("year", "month", "day"))]

# Reorder columns to place 'date' after 'name'
cols_order <- c("name", "date", setdiff(names(storms), c("name", "date")))
storms <- storms[, cols_order]

storms[1:10, ] # viewing the first 10 rows of the data
# A tibble: 10 × 13
   name  date       hour   lat  long status              category  wind pressure
   <chr> <chr>     <dbl> <dbl> <dbl> <fct>                  <dbl> <int>    <int>
 1 Amy   1975-6-27     0  27.5 -79   tropical depression       NA    25     1013
 2 Amy   1975-6-27     6  28.5 -79   tropical depression       NA    25     1013
 3 Amy   1975-6-27    12  29.5 -79   tropical depression       NA    25     1013
 4 Amy   1975-6-27    18  30.5 -79   tropical depression       NA    25     1013
 5 Amy   1975-6-28     0  31.5 -78.8 tropical depression       NA    25     1012
 6 Amy   1975-6-28     6  32.4 -78.7 tropical depression       NA    25     1012
 7 Amy   1975-6-28    12  33.3 -78   tropical depression       NA    25     1011
 8 Amy   1975-6-28    18  34   -77   tropical depression       NA    30     1006
 9 Amy   1975-6-29     0  34.4 -75.8 tropical storm            NA    35     1004
10 Amy   1975-6-29     6  34   -74.8 tropical storm            NA    40     1002
# ℹ 4 more variables: tropicalstorm_force_diameter <int>,
#   hurricane_force_diameter <int>, wind_category <chr>, major_hurricane <chr>

Replicating the functionalities of the function separate_wider_delim() can be also done with base R functions.

rm(storms) #cleaning the modifications done previously in this object
library(tidyverse) # to access the storms dataset 

# Create 'date' column by concatenating 'year', 'month', and 'day'
storms$date <- with(storms, paste(year, month, day, sep = "-"))

# Split 'date' column into 'year', 'month', and 'day' columns
date_parts <- do.call(rbind, strsplit(storms$date, "-"))
storms$year <- date_parts[, 1]
storms$month <- date_parts[, 2]
storms$day <- date_parts[, 3]

# Replace NA values in 'category' with 0
storms$category[is.na(storms$category)] <- 0

# Filter rows where 'category' is 0
storms_filtered <- storms[storms$category == 0, ]

storms_filtered[1:10, ] # View the result of the first 10 rows
# A tibble: 10 × 14
   name  year  month day    hour   lat  long status      category  wind pressure
   <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <fct>          <dbl> <int>    <int>
 1 Amy   1975  6     27        0  27.5 -79   tropical d…        0    25     1013
 2 Amy   1975  6     27        6  28.5 -79   tropical d…        0    25     1013
 3 Amy   1975  6     27       12  29.5 -79   tropical d…        0    25     1013
 4 Amy   1975  6     27       18  30.5 -79   tropical d…        0    25     1013
 5 Amy   1975  6     28        0  31.5 -78.8 tropical d…        0    25     1012
 6 Amy   1975  6     28        6  32.4 -78.7 tropical d…        0    25     1012
 7 Amy   1975  6     28       12  33.3 -78   tropical d…        0    25     1011
 8 Amy   1975  6     28       18  34   -77   tropical d…        0    30     1006
 9 Amy   1975  6     29        0  34.4 -75.8 tropical s…        0    35     1004
10 Amy   1975  6     29        6  34   -74.8 tropical s…        0    40     1002
# ℹ 3 more variables: tropicalstorm_force_diameter <int>,
#   hurricane_force_diameter <int>, date <chr>

base R Vs. tidyverse

base R

  • Syntax: More traditional, can be less intuitive for beginners.
  • Data Handling: Works well with base data structures like vectors, matrices, arrays, and data frames.
  • Data Manipulation: Requires more lines of code for complex operations.
  • Package Ecosystem: Functions spread across various packages.

tidyverse

  • Syntax: Modern, more consistent, and often more readable.
  • Data Handling: Centered around the tibble, a modern take on the data.frame.
  • Data Manipulation: Simplified with dplyr, enabling complex operations with fewer lines of code.
  • Package Ecosystem: Integrated suite of packages designed to work together seamlessly.

The remainder tidyverse core packages

The remainder tidyverse core packages

Bear in mind that they are as useful and powerful as the 3 packages (tibble, dplyr and tidyr) covered in this lecture. Mastering them can be very advantageous to work more efficiently with tidyverse suit of packages and empower us to create more useful, tidy and clear workflows.

The remainder tidyverse core packages

It is important to highlight that the same syntax applies for the remainder tidyverse core packages. So it is possible, and recommended, to build modular codes, it can even be done for plotting (ggplot2).

The remainder tidyverse core packages

  • A very quick example on plugging-in processing and plotting data
library(tidyverse)

# Load the tidyverse package for data manipulation and plotting
library(tidyverse)

storms %>% 
  filter(year > 2009) %>%  # Filter the data to include only years greater than 2009
  group_by(year) %>% # Group the filtered data by 'year'
  summarise(mean_wind_speed = mean(wind, na.rm = T)) %>% # Calculate the mean wind speed for each year, ignoring NA values
  ggplot(aes(x = year, y = mean_wind_speed)) + # Initialize a ggplot object, mapping 'year' to x-axis and 'mean_wind_speed' to y-axis
  geom_col(fill = "dodgerblue2", col = "black") + # Add a column plot (bar plot) to the ggplot object with specific color and border
  coord_flip() + # Flip the coordinates to make the bars horizontal
  labs(title = "Storms", x = "Year", y = "Average wind speed (Km/h)") + # Add a title and labels to the x-axis and y-axis
  theme_bw() +  # Apply a black-and-white theme for a cleaner look
  theme(text = element_text(size = 14)) # Customize the text size for all text elements in the plot

Additional resources

RStudio IDE cheat sheet

Core packages cheat sheets

Additional Learning Resources